An Evaluation of the Concept Retrieval Annotation for Spanish-English CLEFER Parallel Corpora
نویسندگان
چکیده
This paper presents a study about the use of the concept retrieval annotation method for parallel corpora. The concept retrieval annotation method (CRA) consists of considering concepts as documents and text chunks as queries [1]. Concepts with higher similarity to text chunks are considered for generating the final semantic annotation. CRA makes use of an existing knowledge resource (KR) from which lexicons are extracted to perform the semantic annotation. Until now, CRA has been applied to mono-lingual scenarios showing a good performance over both very large collections (e.g., CALBCII-SSC) and very large lexicons (e.g., UMLS R © [2]). We have also applied this semantic annotator to different tasks in Biomedicine such as resource discovery [3], relation extraction [4], and sicentific bibliography analysis [5]. In this work, we will apply CRA in a bi-lingual scenario. For this purpose, we make use of the provided lexicons at CLEFER workshop. More specifically, we have made use of the English and Spanish lexicons. In this extended abstract, we first summarize the main features of CRM as a cross-lingual annotator, and then obtained results over the two provided parallel corpora, EMEA and MEDLINE R ©.
منابع مشابه
Unsupervised Disambiguation for a Multilingual Medical Information System using UMLS
This paper describes techniques for unsupervised word sense disambiguation of English and German medical documents using the Unified Medical Language System (UMLS). We present both monolingual techniques which rely only on the structure of UMLS, and bilingual techniques which also rely on the availability of parallel corpora. The best results are obtained using relationships between terms given...
متن کاملThe Tdt-3 Text and Speech Corpus
The TDT-3 Text and Speech Corpus expands on previous phases of Topic Detection and Tracking data collections, by increasing the number of news sources being sampled, by including Mandarin Chinese as well as English news data, and by introducing new forms of topic annotation. In order to satisfy the specific data and annotation requirements of the TDT-3 Evaluation Plan[1], the LDC refined and su...
متن کاملA multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
OBJECTIVE To create a multilingual gold-standard corpus for biomedical concept recognition. MATERIALS AND METHODS We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified M...
متن کاملInterlingual Annotation of Parallel Text Corpora: A New Framework for Annotation and Evaluation
This paper focuses on the next step in the creation of a system of meaning representation and the development of semantically-annotated parallel corpora, for use in applications such as machine translation, question answering, text summarization, and information retrieval. The work described below constitutes the first effort of any kind to provide parallel corpora annotated with detailed deep ...
متن کاملAutomatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System
We present a method for the automated acquisition of a multilingual medical lexicon (for Spanish and Swedish) to be used within the framework of a medical cross-language text retrieval system. We incorporate seed lexicons and parallel corpora derived from the UMLS Metathesaurus. The seed lexicons for Spanish and Swedish are automatically generated from (previously manually constructed) Portugue...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013